To Read List
- [x] Gavel:把Gavel的解决方案总结到博客中(Stanford)
- [ ] TinyQuanta:cluster scheduling方向,微秒级高效盲调度系统(UCB)
- [ ] ExeGPT:LLM推理的调度(Hanyang University)
- [ ] Optimizing Speculative Decoding for Serving Large Language Models Using Goodput
- [x] Llumnix:LLM推理的动态调度系统,在处理tail latencies上有很大优势(Alibaba)
- [ ] Sia:DL混合并行弹性作业调度系统(UCB)
- [ ] Mooncake
A guide to LLM inference and performance
A guide to LLM inference and performance
动态剪枝
to be continued
矩阵分解
to be continued
大模型稀疏化
to be continued
KV Cache量化
to be continued
Speculative Decoding
LLM推理加速新范式!推测解码(Speculative Decoding)最新综述 - 知乎